Sentence Boundary Disambiguation: A User Friendly Approach

نویسندگان

  • Pritam Singh Negi
  • H. S. Dhami
چکیده

In the present work we have developed an algorithm based on maximum entropy and stop word removal modules, which works with almost 99% accuracy and have established supremacy over the existing paragraph breaker software developed by Text Mining Group, School of Computer Science, Manchester University, United Kingdom .

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A hybrid approach for urdu sentence boundary disambiguation

Sentence boundary identification is a preliminary step for preparing a text document for Natural Language Processing tasks, e.g., machine translation, POS tagging, text summarization and etc. We present a hybrid approach for Urdu sentence boundary disambiguation comprising of unigram statistical model and rule based algorithm. After implementing this approach, we obtained 99.48% precision, 86.3...

متن کامل

Periods, Capitalized Words, etc

In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by consi...

متن کامل

An artificial neural network approach for sentence boundary disambiguation in urdu language text

Sentence boundary identification is an important step for text processing tasks, e.g., machine translation, POS tagging, text summarization etc., in this paper, we present an approach comprising of Feed Forward Neural Network (FFNN) along with part of speech information of the words in a corpus. Proposed adaptive system has been tested after training it with varying sizes of data and threshold ...

متن کامل

Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation

Urdu is morphologically rich language with different nature of its characters. Urdu text tokenization and sentence boundary disambiguation is difficult as compared to the language like English. Major hurdle for tokenization is improper use of space between words, where as absence of case discrimination makes the sentence boundary detection a difficult task. In this paper some issues regarding b...

متن کامل

Automatic Sentence Break Disambiguation for Thai

Unlike English, there is no explicit sentence marker in Thai language. Conventionally, a space is placed at the end of the sentence when written in Thai. But it does not mean that a space always indicates the sentence boundary. In this paper, we propose the algorithm, which is a feature-based approach, to extract sentences from a paragraph by detecting the appropriate sentence breaking spaces. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010